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Task  1:  Representation  Techniques  for  Complex  Documents 

Task  Objectives 

In  this  task,  the  goal  is  to  extend  the  word-based  representations  that  are  common  in 
retrieval  systems  in  order  to  support  summarization,  browsing,  and  more  effective 
retrieval.  Specifically,  we  will  be  studying  phrase-based  representations  and  relationships 
between  phrases  in  individual  and  groups  of  documents  as  the  basis  for  our  approach. 
Document  structure  will  be  used  as  part  of  the  information  that  is  used  to  "tag"  the  phrasal 
representation. 

Technical  Problems 

The  technical  problems  have  to  do  with  defining  a  "phrase",  developing  techniques  for 
rapidly  extracting  them  from  text,  comparing  phrase  contexts  to  identify  significant 
relationships,  producing  summaries  from  these  representations,  and  extending  the 
underlying  retrieval  model  to  be  able  to  make  effective  use  of  phrasal  representations  in 
both  query-based  retrieval  and  relevance  feedback. 

General  Methodology 

The  general  methodology  for  this  task  is  to  demonstrate  effectiveness  through  user-based 
and  collection-based  experiments.  Extensive  use  will  be  made  of  the  TIPSTER  document 
collection,  which  consists  of  a  large  number  of  text  documents  from  a  variety  of  sources, 
queries,  and  user  relevance  judgments  for  each  query.  This  collection  will  be  used  for  the 
experiments  involving  new  probabilistic  retrieval  models  and  relevance  feedback. 
Summarization  techniques  will  be  compared  to  sentence-based  approaches  and  user-based 
evaluations  of  these  sununaries  will  be  done.  As  more  work  is  done  on  summarization  in 
the  TIPSTER  program,  we  will  make  use  of  any  new  evaluation  measures  developed  there. 

Technical  Results 

We  have  continued  to  refine  the  techniques  for  identifying  phrases  for  indexing  in  large 
corpora.  Based  on  our  analysis  of  results  from  the  TREC  corpus,  patterns  that  occur  within 
fixed  window  sizes  are  not  superior  to  patterns  made  up  of  consecutive  words.  In 
addition,  rules  for  pruning  the  patterns  and  special  classes  of  patterns  (such  as  quantities) 
have  been  identified.  An  initial  approach  to  incorporating  large  phrase  lists  in  the  indexing 
process  has  also  been  developed.  We  have  also  begun  to  look  at  phrase  clustering  as  a 
technique  for  summarizing  document  collections  and  large  documents.  Global  measures  of 
significance  that  go  beyond  simple  frequency  counts  are  also  being  considered. 
Experiments  on  selecting  the  core  concepts  in  a  query  have  produced  mixed  results.  In 
discussions  with  PTO,  we  have  found  that  patent  classification  is  an  area  of  significant 
interest  and  we  have  begun  to  develop  classification  research  tasks. 
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Important  Findings  and  Conclusions 


It  is  too  early  in  the  project  to  claim  any  conclusive  results.  The  TREC  results  showed  no 
significant  improvement  obtained  from  the  technique  we  were  using  to  identify  core  terms, 
although  this  was  not  the  case  in  earlier  experiments.  We  are  currently  working  on 
improving  these  techniques. 

Significant  Hardware  Development 

Purchases  were  made  of  disk  to  store  large  collections  of  patent  data. 

Special  Comments 

We  continue  to  work  with  the  PTO,  San  Diego  Supercomputer  consortium  and  DARPA  to 
obtain  access  to  some  of  the  patent  collections  and  establish  fast  network  links  in  order  to 
be  able  to  use  the  very  large  archives  of  scanned  patents. 

Implication  for  Further  Research 

The  key  to  more  experiments  is  to  obtain  patent  data.  We  plan  to  refine  the  initial  research 
goals  based  on  feedback  from  PTO  and  present  these  in  the  next  status  report. 


Task  2:  Browsing  and  Discovery  Techniques  for  Document  Collections 
Task  Objectives 

The  goals  of  this  task  are  to  develop  techniques  for  summarizing  collections  of  documents, 
and  discovering  connections  between  important  ideas  and  documents  in  distributed 
collections.  These  techniques  will  be  designed  to  support  interactive  browsing  in 
environments  like  the  PTO. 

Technical  Problems 

The  technical  problems  involve  producing  an  effective  summary  of  a  group  of  documents, 
such  as  a  retrieved  set  or  an  entire  database.  Both  document  and  phrase  clusters  could  be 
used  as  part  of  this  process.  In  order  to  support  discovery,  connections  must  be  made 
between  documents  and  groups  of  phrases  ftat  use  a  variety  of  evidence  in  addition  to 
direct  co-occurrence. 

General  Methodology 

The  techniques  will  be  evaluated  with  user-based  and  collection-based  experiments.  The 
relevance  judgments  from  the  TIPSTER  collection  will  be  used  to  evaluate  clusters  of 
documents.  Phrase  clusters  will  be  evaluated  by  their  impact  on  retrieval  effectiveness  and 
through  user  experiments  that  will  meeisure  performance  on  specific  tasks.  Part  of  the  effort 
in  this  task  (and  the  previous  one)  will  involve  developing  a  PTO  test  collection,  which 
means  that  sample  queries  will  need  to  be  gathered  from  patent  examiners  and  they  will 
need  to  evaluate  de  monstrations  of  tools  as  they  are  developed. 

Technical  Results 

We  have  continued  to  develop  a  3-D  graphics  interface  designed  for  manipulating  document 
and  concept  relationships  to  identify  strong  groupings  and  relationships.  This  work  is 
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being  evaluated  by  comparing  the  visualization  to  known  relevant  groups  based  on  TREC 
data.  We  have  also  started  work  on  phrase  clusters  as  summaries  of  retrieved  sets  and 
document  collections. 

Important  Findings  and  Conclusions 

Our  experiments  with  the  visualization  tool  have  not  yet  produced  significant  results,  but 
new  approaches  have  shown  some  promise. 

Significant  Hardware  Development 

None 

Special  Comments 
Implication  for  Further  Research 

We  will  continue  to  work  on  visualization  and  summarization  techniques  that  are  relevant  to 
tasks  in  the  PTO.  Elements  such  as  the  PTO  classification  stracture  will  play  more  of  a  role 
in  future  developments. 


Task  3:  Scanned  Document  Indexing  and  Retrieval 
Task  Objectives 

The  goals  of  this  task  are  to  develop  techniques  for  detecting  text,  trademarks,  logos,  and 
images  in  scanned  documents,  clean  up  backgrounds  of  these  detected  objects,  and  support 
retrieval  of  images  (such  as  designs  in  design  patents),  trademarks,  and  text  from  OCR. 

Technical  Problems 

Current  zoning  techniques  available  with  commercial  OCR  devices  do  not  accurately  locate 
text  or  trademarks  within  other  images.  We  are  developing  techniques  based  on  gaussian 
derivative  filters  to  both  detect  and  clean  up  (remove  noisy  backgrounds)  these  classes  of 
objects  in  scanned  documents.  We  are  developing  "appearance-based"  retrieval  of  images 
as  well  as  more  straightforward  features  such  as  color  and  texture.  Filter  based  and 
frequency  domain  based  techniques  offer  some  potential  in  this  area,  but  significant  work 
needs  to  be  done  on  making  this  approach  efficient  enough  to  deal  with  thousands  of 
images. 

General  Methodology 

The  evaluation  of  these  techniques  will  be  done  in  a  similar  way  to  text  by  developing  test 
collections  of  images  and  scanned  documents.  Specifically,  we  are  working  to  obtain  large 
collections  of  trademarks  and  design  patents,  as  well  as  typical  queries. 

Technical  Results 

We  have  obtained  initial  results  from  indexing  images  and  peforming  retrieval  tests  using  a 
set  of  image-based  queries.  These  results  have  shown  that  retrieval  speed  has  improved  by 
more  than  a  factor  of  20  while  retrieval  effectiveness  has  been  maintained.  We  have  also 
been  developing  test  databases  for  text  detection  and  initial  evaluations  of  these  techniques 
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look  promising.  We  have  begun  to  study  retrieval  of  design  patents,  but  this  is  still  at  an 
early  stage. 

Important  Findings  and  Conclusions 

Recent  results  indicate  that  our  techniques  can  provide  effective  andd  efficient  retrieval  of 
some  types  of  images.  We  must  now  apply  these  techniques  to  design  patents  and 
trademarks,  and  continue  to  work  on  indexing  techniques  that  will  support  massive  scaleup 
of  image  database  size. 

Significant  Hardware  Development 

Disk  acquisition. 

Special  Comments 

Gaining  access  to  PTO  design  patents  and  trademarks  has  been  a  priority. 

Implication  for  Further  Research 

We  plan  to  have  more  meetings  with  patent  examiners  who  deal  with  design  patents  and 
trademarks.  As  soon  as  enough  test  data  is  obtained,  we  will  begin  work  on  customizing 
the  techniques  for  the  types  of  drawings  found  in  design  patents. 


Task  4:  Distributed  Retrieval  Architecture 
Task  Objectives 

The  goals  of  this  task  are  to  scale  up  our  current  methods  of  automatically  selecting 
collections  and  merging  results,  and  to  investigate  architectures  that  can  support  efficient 
retrieval,  browsing  and  relevance  feedback  in  distributed  environments  with  terabytes  of 
information. 

Technical  Problems 

The  current  INQUERY  text  retrieval  system  uses  a  client  server  architecture  to  support 
simultaneous  retrieval  from  multiple  collections  distributed  across  one  or  more  processors. 
A  number  of  efficiency  bottlenecks  develop,  however,  when  the  size  of  the  databases  is 
very  large.  Deciding  which  subcollections  to  search  can  address  part  of  the  problem,  but 
there  are  other  problems  associated  with  the  fundamental  efficiency  of  the  processes 
involved  and  the  use  of  distributed  resources.  Image  indexing  and  retrieval  tends  to  make 
all  of  these  problems  worse  since  the  databases  and  indexes  are  considerably  larger. 

General  Methodology 

The  architectures  and  algorithms  produced  in  this  task  will  be  evaluated  using  a 
combination  of  standard  performance  (efficiency)  measures  and  effectiveness  measures. 
The  efficiency  tests  will  be  done  using  large  PTO  databases,  including  images,  and  the 
collection  selection  algorithms  will  be  evaluated  using  the  text  subcollections  of  the  patents. 
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Technical  Results 


We  have  evaluated  the  multi-threaded  version  of  INQUERY  and  found  little  performance 
difference  compared  to  a  client-server  version.  Experiments  continue  in  this  area  and  we 
have  also  been  evaluating  INQUERY's  performance  on  large,  distributed  collections.  This 
evaluation  has  shown  that  there  are  significant  performance  issues  that  need  to  be  addressed 
in  order  to  maintain  adequate  response  times,  although  query  optimiation  and  phrase-based 
indexing  are  expected  to  have  a  major  impact. 

ImportMt  Findings  and  Conclusions 

Initial  results  suggest  no  advantage  to  multi-threaded  implementations  in  large,  distributed 
environments. 

Significant  Hardware  Development 

Purchases  were  made  of  disk  to  store  large  collections  of  patent  data. 

Special  Comments 

Previous  comments  on  fast  network  access  to  other  PTO  sites  are  particularly  relevant  here, 
since  this  will  be  required  to  both  test  the  distributed  architecture  and  to  index  and  retrieve 
the  full  versions  of  Ae  PTO  databases. 

Implications  for  Further  Research 

We  are  still  planning  to  ramp  up  the  experiments  on  collection  selection  and  result 
merging  as  well  as  continuing  performance  experiments. 
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